Parsing the Arabic Treebank: Analysis and Improvements

نویسندگان

Seth Kulick

Ryan Gabbard

Mitchell Marcus

چکیده

Previous work has demonstrated that the performance of current parsers on Arabic is far below their performance on English or even Chinese, which in turn harms performance on NLP tasks that use parsing as an input. This paper is an exploration of some of the issues involved in this difference. We focus on the Collins parsing model [3] as implemented in the Bikel parser [1]. The corpus used for the experiments is the Arabic Treebank [6] (ATB). We cluster these issues in three ways. First, it is important when comparing Arabic parsing performance to other languages that the comparison be a fair one; therefore we first discuss some issues around evaluation and show that current Arabic parsing performance is not quite as bad as previously thought. Second, we present some modifications to the parser which provide modest increases in performance. Finally, we explore deeper differences between the Arabic Treebank and the Penn Treebank and advance some speculations as to why parsers have difficulty with Arabic.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines

The Arabic Treebank team at the Linguistic Data Consortium has significantly revised and enhanced its annotation guidelines and procedure over the past year. Improvements were made to both the morphological and syntactic annotation guidelines, and annotators were trained in the new guidelines, focusing on areas of low inter-annotator agreement. The revised guidelines are now being applied in an...

متن کامل

Turkish Treebank as a Gold Standard for Morphological Disambiguation and Its Influence on Parsing

So far predicted scenarios for Turkish dependency parsing have used a morphological disambiguator that is trained on the data distributed with the tool(Sak et al., 2008). Although models trained on this data have high accuracy scores on the test and development data of the same set, the accuracy drastically drops when the model is used in the preprocessing of Turkish Treebank parsing experiment...

متن کامل

Utilizing State-of-the-art Parsers to Diagnose Problems in Treebank Annotation for a Less Resourced Language

The recent success of statistical parsing methods has made treebanks become important resources for building good parsers. However, constructing highquality annotated treebanks is a challenging task. We utilized two publicly available parsers, Berkeley and MST parsers, for feedback on improving the quality of part-of-speech tagging for the Vietnamese Treebank. Analysis of the treebank and parsi...

متن کامل

Syntactic Analysis of the Tunisian Arabic

In this paper, we study the problem of syntactic analysis of Dialectal Arabic (DA). Actually, corpora are considered as an important resource for the automatic processing of languages. Thus, we propose a method of creating a treebank for the Tunisian Arabic (TA) “Tunisian Treebank” in order to adapt an Arabic parser to treat the TA which is considered as a variant of the Arabic language.

متن کامل

Better Arabic Parsing: Baselines, Evaluations, and Analysis

In this paper, we offer broad insight into the underperformance of Arabic constituency parsing by analyzing the interplay of linguistic phenomena, annotation choices, and model design. First, we identify sources of syntactic ambiguity understudied in the existing parsing literature. Second, we show that although the Penn Arabic Treebank is similar to other treebanks in gross statistical terms, ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Parsing the Arabic Treebank: Analysis and Improvements

نویسندگان

چکیده

منابع مشابه

Enhancing the Arabic Treebank: a Collaborative Effort toward New Annotation Guidelines

Turkish Treebank as a Gold Standard for Morphological Disambiguation and Its Influence on Parsing

Utilizing State-of-the-art Parsers to Diagnose Problems in Treebank Annotation for a Less Resourced Language

Syntactic Analysis of the Tunisian Arabic

Better Arabic Parsing: Baselines, Evaluations, and Analysis

عنوان ژورنال:

اشتراک گذاری